Concurrent PAC RL

نویسندگان

  • Zhaohan Guo
  • Emma Brunskill
چکیده

In many real-world situations a decision maker may make decisions across many separate reinforcement learning tasks in parallel, yet there has been very little work on concurrent RL. Building on the efficient exploration RL literature, we introduce two new concurrent RL algorithms and bound their sample complexity. We show that under some mild conditions, both when the agent is known to be acting in many copies of the same MDP, and when they are not the same but are taken from a finite set, we can gain linear improvements in the sample complexity over not sharing information. This is quite exciting as a linear speedup is the most one might hope to gain. Our preliminary experiments confirm this result and show empirical benefits. The ability to share information across tasks to speed learning is a critical aspect of intelligence, and an important goal for autonomous agents. These tasks may themselves involve a sequence of stochastic decisions: consider an online store interacting with many potential customers, or a doctor treating many diabetes patients, or tutoring software teaching algebra to a classroom of students. Here each task (customer relationship management, patient treatment, student tutoring) can be modeled as a reinforcement learning (RL) problem, with one decision maker performing many tasks in parallel. In such cases there is an opportunity to improve outcomes for all tasks (customers, patients, students) by leveraging shared information across the tasks. Interestingly, despite these compelling applications, there has been almost no work done on concurrent reinforcement learning. There has been a number of papers (e.g. (Evgeniou and Pontil 2004; Xue et al. 2007)) on supervised concurrent learning (referred to as multi-task learning). In this context, multiple supervised learning tasks, such as classification, are run in parallel, and information from each is used to speed learning. When the tasks themselves involve sequential decision making, like reinforcement learning, prior work has focused on sharing information serially across consecutive related tasks, such as in transfer learning (e.g. (Taylor and Stone 2009; Lazaric and Restelli 2011)) or online learning across a set of tasks (Brunskill and Li 2013). Note that multiagent literature considers multiple agents acting in a single Copyright c © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. environment, whereas we consider the different problem of one agent / decision maker simultaneously acting in multiple environments. The critical distinction here is that the actions and rewards taken in one task do not directly impact the actions and rewards taken in any other task (unlike multiagent settings) but information about the outcomes of these actions may provide useful information to other tasks, if the tasks are related. One important exception is recent work by Silver et al. (2013) on concurrent reinforcement learning when interacting with a set of customers in parallel. This work nicely demonstrates the substantial benefit to be had by leveraging information across related tasks while acting jointly in these tasks, using a simulator built from hundreds of thousands of customer records. However, this paper focused on an algorithmic and empirical contribution, and did not provide any formal analysis of the potential benefits of concurrent RL in terms of speeding up learning. Towards a deeper understanding of concurrent RL and the potential advantages of sharing information when acting in parallel, we present new algorithms for concurrent reinforcement learning and provide a formal analysis of their properties. More precisely, we draw upon literature on Probably Approximately Correct (PAC) RL (Kearns and Singh 2002; Brafman and Tennenholtz 2002; Kakade 2003), and bound the sample complexity of our approaches, which is the number of steps on which the agent may make a sub-optimal decision, with high probability. Interestingly, when all tasks are identical, we prove that simply by applying an existing state-of-the-art single task PAC RL algorithm, MBIE (Strehl and Littman 2008), we can obtain, under mild conditions, a linear improvement in the sample complexity, compared to learning in each task with no shared information. We next consider a much more generic situation, in which the presented tasks are sampled from a finite, but unknown number of discrete state–action MDPs, and the identity of each task is unknown. Such scenarios can arise for many applications in which an agent is interacting with a group of people in parallel: for example, Lewis (Lewis 2005) found that when constructing customer pricing policies for news delivery, customers were best modeled as being one of two (latent) types, each with distinct MDP parameters. We present a new algorithm for this setting and prove that under fairly general conditions, that if any two distinct MDPs differ in their model parameters by a minimum gap for at least one state–action pair, and the MDPs have finite diameter, that we can also obtain essentially a linear improvement in the sample complexity bounds across identical tasks. Our approach incurs no dominant overhead in sample complexity by having to perform a clustering amongst tasks, implying that if all tasks are distinct, the resulting (theoretical) performance will be equivalent to as if we performed single task PAC RL in each task separately. These results provide an interesting counterpart to the sequential transfer work of Brunskill and Li (2013) which demonstrated a reduction in sample complexity was possible if an agent performed a series of tasks drawn from a finite set of MDPs; however, in contrast to that work that could only gain a benefit after completing many tasks, clustering them, and using that knowledge for learning in later tasks, we demonstrate that we can effectively cluster tasks and leverage this clustering during the reinforcement learning of those tasks to improve performance. We also provide small simulation experiments that support our theoretical results and demonstrate the advantage of carefully sharing information during concurrent reinforcement learning. Background A Markov decision process (MDP) is a tuple 〈S,A, T,R, γ〉 where S is a set of states, A is a set of actions, T is a transition model where T (s′|s, a) is the probability of starting in state s, taking action a and transitioning to state s′, R(s, a) ∈ [0, 1] is the expected reward received in state s upon taking action a, and γ is an (optional) discount factor. When it is clear from context we may use S and A to denote |S| and |A| respectively. A policy π is a mapping from states to actions. The value V (s) of a policy π is the expected sum of discounted rewards obtained by following π starting in state s. We may use V (s) when the policy is clear in context. The optimal policy π∗ for a MDP is the one with the highest value function, denoted V ∗(s). In reinforcement learning (RL) the transition and reward models are initially unknown. Probably Approximately Correct (PAC) RL methods (Kearns and Singh 2002; Brafman and Tennenholtz 2002; Strehl, Li, and Littman 2006) guarantee the number of steps on which the agent will make a less than -optimal decision, the sample complexity, is bounded by a polynomial function of the problems’ parameters, with high probability. Sample complexity can be viewed as a measure of the learning speed of an algorithm, since it bounds the number of possible mistakes the algorithm will make. We will similarly use sample complexity to formally bound the potential speedup in learning gained by sharing experience across tasks. Our work builds on MBIE, a single-task PAC RL algorithm (Strehl and Littman 2008). In MBIE the agent uses its experience to construct confidence intervals over its estimated transition and reward parameters. It computes a policy by performing repeated Bellman backups which are optimistic with respect to its confidence intervals, thereby constructing an optimistic MDP model, an optimistic estimate of the value function, and an optimistic policy. This policy will drive the agent towards little experienced state–action pairs or state–action pairs with high reward. We chose to build on MBIE due to its good sample complexity bounds and very good empirical performance. We think it will be similarly possible to create concurrent algorithms and analysis building on other single-agent RL algorithms with strong performance guarantees, such as recent work by Lattimore, Hutter and Sunehag (2013), but leave this direction for future work. Concurrent RL in Identical Environments We first consider a decision maker (a.k.a agent) performing concurrent RL across a set of K MDP tasks. The model parameters of the MDP are unknown, but the agent does know that allK tasks are the same MDP. At time step t, each MDP k is in a particular state stk. The decision maker then specifies an action for each MDP a1, . . . , aK . The next state of each MDP then is generated given the stochastic dynamics model T (s′|s, a) for the MDP and all the MDPs synchronously transition to their next state. This means the actual state (and reward) in each task at each time step will typically differ. 1. In addition there is no interaction between the tasks: imagine an agent coordinating the repair of many identical-make cars. Then the state of repair in one car does not impact the state of repair of another car. We are interested in formally analyzing how sharing all information can impact learning speed. At best one might hope to gain a speedup in learning that scales exactly linearly with the number of MDPs K. Unfortunately such a speedup is not possible in all circumstances, due to the possibility of redundant exploration. For example, consider a small MDP where all the MDPs start in the same initial state. One action transitions to a part of the state space with low rewards, and another action to a part with high rewards. It takes a small number of tries of the bad action to learn that it is bad. However in the concurrent setting, if there are many many MDPs, then the bad action will be tried much more than necessary because the rest of the states have not yet been explored. This potential redundant exploration is inherently due to the concurrent, synchronous, online nature of the problem, since the decision maker must assign an action to each MDP at each time step, and can’t wait to see the outcomes of some decisions before assigning other actions to other MDPs. Interestingly, we now show that a trivial extension of the MBIE algorithm is sufficient to achieve a linear improvement in the sample complexity for a very wide range of K, with no complicated mechanism needed to coordinate the exploration across the MDPs. Our concurrent MBIE (CMBIE) algorithm uses the MBIE algorithm in its original form except we share the experience from all K agents. We now give a high-probability bound on the total sample complexity across all K MDPs. As at each time step the algorithm selects K actions, our sample complexity is a bound on the total number of non-optimal actions selected (not just the number of steps). Proofs, when not inline are available in the appendix. We suspect it will be feasible to extend to asynchronous situations but for clarity we focus on synchronous execution and leave asynchronous actions for future work. (a) Skinny/ filled thick/ empty thick arrows yield reward 0.03/ 0.02/ 1 with prob 1/ 1/ 0.02. (b) Average reward per MDP per time step for CMBIE, when running in 1, 5, or 10 copies of the same MDP. A sliding window average of 100 steps is used for readability. (c) Total cumulative reward per MDP after 10000 time steps versus number of MDPs. (d) The number of total mistakes made after 10000 time steps versus number of MDPs. Figure 1: CMBIE Experiments Theorem 1. Given and δ, andK agents acting in identical copies of the same MDP, Concurrent MBIE (CMBIE) will select an -optimal action for allK agents on all but at most

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

UBEV - A More Practical Algorithm for Episodic RL with Near-Optimal PAC and Regret Guarantees

Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform vers...

متن کامل

Probably Approximately Correct (PAC) Exploration in Reinforcement Learning

OF THE DISSERTATION Probably Approximately Correct (PAC) Exploration in Reinforcement Learning by Alexander L. Strehl Dissertation Director: Michael Littman Reinforcement Learning (RL) in finite state and action Markov Decision Processes is studied with an emphasis on the well-studied exploration problem. We provide a general RL framework that applies to all results in this thesis and to other ...

متن کامل

Sample Efficient Reinforcement Learning with Gaussian Processes

This paper derives sample complexity results for using Gaussian Processes (GPs) in both modelbased and model-free reinforcement learning (RL). We show that GPs are KWIK learnable, proving for the first time that a model-based RL approach using GPs, GP-Rmax, is sample efficient (PAC-MDP). However, we then show that previous approaches to model-free RL using GPs take an exponential number of step...

متن کامل

PAC-Bayesian Policy Evaluation for Reinforcement Learning

Bayesian priors offer a compact yet general means of incorporating domain knowledge into many learning tasks. The correctness of the Bayesian analysis and inference, however, largely depends on accuracy and correctness of these priors. PAC-Bayesian methods overcome this problem by providing bounds that hold regardless of the correctness of the prior distribution. This paper introduces the first...

متن کامل

A PAC RL Algorithm for Episodic POMDPs

Many interesting real world domains involve reinforcement learning (RL) in partially observable environments. Efficient learning in such domains is important, but existing sample complexity bounds for partially observable RL are at least exponential in the episode length. We give, to our knowledge, the first partially observable RL algorithm with a polynomial bound on the number of episodes on ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015